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Abstract — Often adaptive, distributed control can be viewed 
as an iterated game between independent players. The cou- 
pling between the players’ mixed strategies, arising as the 
system evolves from one instant to the next, is determined 
by the system designer. Information theory tells us that the 
most likely joint strategy of the players, given a value of the 
expectation of the overall control objective function, is the 
minimizer of a Lagrangian function of the joint strategy. So 
the goal of the system designer is to speed evolution of the 
joint strategy to that Lagrangian minimizing point, lower the 
expectated value of the control objective function, and repeat. 
Here we elaborate the theory of algorithms that do this using 
local descent procedures, and that thereby achieve ef£cient, 
adaptive, distributed control. 

I. INTRODUCTION 

This paper considers the problem of adaptive distributed 
control [1], [2], [3]. Typically in such problems, at each 
time i each control agent i sets its state x\ independently 
of the other agents, by sampling an associated distribution, 
qKxf). Rather than directly via statistical dependencies of 
the agents’ states at the same time f, the coupling between 
the agents arises indirectly, through the stochastic joint 
evolution of their distributions {q\} across time. 

More formally, let time be discrete, where at the begin- 
ning of each t eveiy agent sets its state (“makes its move”), 
and then the rest of the system responds. Indicate the state of 
the entire system at time t as z l . (z l includes x t 9 as well as 
all stochastic elements not being directly controlled.) So the 
joint distribution of the moves of the agents at any moment 
t is given by the product distribution q t {x t ) = YU ( x i)> 
and the state of the entire system, given joint move x f , is 
governed by P{z l \ x f ). 

Now in general the observations by agent i of aspects 
of the system’s state at times previous to t will determine 
q\. So q\ is statistically dependent on the previous states of 
the entire system, In other words, the agents can 

be viewed as players in a repeated game with Nature, each 
playing mixed strategies {q\} at moment t [4], [5], [6], [7], 
[8]. Their interdependence arises through information sets 
and the like, in the usual way. 

From this perspective what the designer of a distributed 
control system can specify is the stochastic laws governing 
the updating of the joint strategy. In other words, the 
designer wishes to impose a stochastic dynamics on a Multi- 
Agent System (MAS) that optimizes an overall objective 
function of the state of the system in which the MAS is 
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embedded, F{z)} Formally, this means inducing a joint 
strategy q(x) with a good associated value of E q (F ) = 
/ dxq(x)E(F | x) = f dxq(x)G(x). 2 Once such a q is 
found, one can sample it to get a £nal x, and be assured 
that, on average, the associated F value is low. G is called 
the world utility. 

In this paper we elaborate a set of algorithms that itera- 
tively update q l in such a manner. The algorithms presented 
here are based on using steepest descent techniques to 
minimize a G-parameterized Lagrangian, La(q)? Because 
the descent is over Euclidean vectors q , these algorithms 
can be applied whether the x* are categorical, continuous, 
time-extended, or a mixture of the three. So in particular, 
they provide a principled way to do “gradient descent over 
categorical variables”. 

In the next section we £rst derive the Lagrangian Lciq) 
aiiu discuss some of its properties. 

In the following section we show how to apply gradient 
descent (and its embellishments) to optimization of the 
Lagrangian. If we view the agents as engaged in a team 
game, all having the same utility G, then this gradient 
descent is a distributed scheme for each agent to update 
its strategy, in a way that will steer the game to a bounded 
rational equilibrium [14], [9]. 

In this section we also consider second order methods. 
In contrast to gradient descent, in general any single appli- 
cation of Newton’s method to update a product distribution 
q will result in a new distribution p(q) that is not a product 
distribution. So we must instead solve for the product 
distribution q r (p ) having minimal Kullback-Leibler distance 
to p. In this section we derive the rule for iterative updating 
of our distribution so as to move q in the direction of 

In practice any local descent scheme often requires 
Monte Carlo sampling to estimate terms in the gradient 
To minimize the expected quadratic error of the estimation, 
typically the game is changed from being a team game. In 
other words, in general changing the agent’s utilities g t to 
not all equal G will result in lower bias plus variance of 
the estimation of the gradient, and therefore will speed evo- 
lution to a good joint strategy. These and other techniques 

] Here we follow the convention that lower F is better. In addition, for 
simplicity we only consider objectives that depend on the state of the 
system at a single instant; it is straightforward to relax this restriction. 

2 For simplicity, here we indicate integrals of any sort, including point 
sums for countable x, with the f symbol. 

3 See [9], [10], [11], [12] for non-local techniques for £nding g t 1 
techniques that are related to frticious play, and see [13] for techniques 
that exploit the Metropolis-Hastings algorithm. Other non-local techniques 
are related to importance sampling of integrals, and are brieGy mentioned 
in [14]. 



for shrinking bias plus variance are discussed in [ 11 ], [ 10 ]. 

We end this section by mentioning some other techniques 
for improving the Monte Carlo sampling. These include 
data-aging, and techniques for managing the descent when 
it gets close to a border of the (simplex of) allowed q , Q. 
Most of these techniques introduced can be used even with 
schemes for minimizing La(q) other than gradient descent. 

In the £nal section we introduce some alternatives to 
L G (q), designed to help speed convergence to a q with low 
E(G). Miscellaneous proofs can be found in the appendix. 

The general mathematical framework for casting control 
and optimization problems in terms of minimizing Lagan- 
gians of probability distributions is known as the theory 
of probability Lagangians. The precise version where the 
probability distributions are product distributions is known 
as “Product Distribution” (PD) theory. [11]. It has many 
deep connections to other £elds, including bounded rational 
game theory and statistical physics [14]. As such it serves as 
a mathematical bridge connecting these disciplines. Some 
initial experimental results concerning the use of PD theory 
for distributed optimization and distributed control can be 
found in [15], [16], [17], [18], [19], [20]. See [13], [10], 

[21] for other uses and extensions of PD theory. 

II. Product Distribution Lagrangians 
A. The maxent Lagrangian 

Say the designer stipulates a particular desired value 
of E(G ), 7 . For simplicity, consider the case where the 
designer has no other knowledge concerning the system 
besides 7 and the fact that the joint strategy is a product 
distribution. Then information theory tells us that the a 
priori most likely q consistent with that information is 
the one that maximizes entropy subject to that information 

[22] , [23], [24]. 4 In other words, of all distributions that 
agree with the designer’s information, that distribution is 
the “easiest” one to induce by random search. 

Given this, one can view the job of the designer of 
a distributed control system as an iterative equilibration 
process. In the £rst stage of each iteration the designer 
works to speed evolution of the joint strategy to the q with 
maximal entropy subject to a particular value of 7 . Once we 
have found such a solution we can replace the constraint — 
replace the target value of E(G ) — with a more dif£cult 
one, and then repeat the process, with another evolution of 
Q [ 11 ]* 

De£ne the maxent Lagrangian by 

L(q) = (3(E q (G) - 7 ) - 5 

= J dxq(x)G(x ) - 7 ) - S{q), (1) 

4 In light of how limited the information is here, the algorithms presented 

below are best-suited to “off the shelf’ uses; incorporating more prior 
knowledge allows the algorithms to be tailored more tightly to a particular 
application. 


where S(q ) is the Shannon entropy of < 7 , — f dxq( rr)ln^|||, 
and for simplicity we here take the prior p to be uniform. 5 . 
Writing it out, for a given 7 , the associated most likely 
joint strategy is given by the q that minimizes L(q) over 
all those (q,0) such that the Lagrange parameter 0 is at a 
critical point of L, i.e., such that = 0 . 

Solving, we £nd that the qi are related to each other via 
a set of coupled Boltzmann equations (one for each agent 
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where the overall proportionality constant for each i is set 
by normalization, the subscript g^ on the expectation value 
indicates that it is evaluated according to the distribution 
n jVi Oj, and 3 is set to enforce the condition E q p(G) = 7 . 
Following Nash, we can use Brouwer’s £xed point theorem 
to establish that for any £xed 0, there must exist at least 
one solution to this set of simultaneous equations. 

If we evaluate E(G) at the solution q@, we £nd that it 
is a declining function of 0. So in following the iterative 
procedure of equilibrating and then lowering 7 we we will 
raise 0, Accordingly, we can avoid the steps of testing 
whether each successive constraint E{G) = 7 is met, and 
simply monotonically increase 0 instead. This allows us to 
avoid ever explicitly specifying the values of 7 . 

Simulated annealing is an example of doing this, where 
rather than work directly with g, one works with random 
samples of it formed via the Metropolis random walk 
algorithm [25], [26], [27], [28]. There is no a priori reason 
to use such an inef£cient means of manipulating q however. 
Here we will work with q directly instead. This will result 
in an algorithm that is not simply “probabilistic” in the 
sense that the updating of its variables is stochastic (as in 
simulated annealing). Rather the very entity being updated 
is a probability distribution. 

B. Shape of the maxent Lagrangian 

Consider L as a function of g, with 0 and 7 both treated 
as £xed parameters. (So in particular, E q (g) need not equal 
7 .) First, say that q ^ is also held £xed, with only qi allowed 
to vary. This makes E(g ) linear in g t . In addition, entropy 
is a concave function, and the unit simplex is a convex 
region. Accordingly, the Lagrangian of Eq. 1 has a unique 
local minimum over g^. So there is no issue of choosing 
among multiple minima when all of q (*) is £xed. Nor is 
there any problem of “getting trapped in a local minimum” 
in a computational search for that minimum. Indeed, in this 
situation we can just jump directly to that global optimum, 
via Eq. 2. 

Now introduce the shorthand for any function U(x) i 
[U]i,p{xi) = dx(i} U(xi, £(q )p(x(^) | xf). 

5 Throughout this paper the terms in any Lagrangian that restrict distri- 
butions to the unit simplices are implicit. The other constraint needed for 
a Euclidean vector to be a valid probability distribution is that none of its 
components are negative. This will not need to be explicitly enforced in 
the Lagrangian here. 



So [G] ij9(i) (xi) is agent z’s “effective” cost function, 
E q(i) (G | Xi). Consider the value This is 

the value of E(G) at V s bounded rational equilibrium for 
the £xed g^), i.e., it is the value at the minimum over g* 
of L . View that value as a function of (3. One can show 
that this is a decreasing function. In fact, its derivative just 
equals the negative of the variance of [G]^ q(i) ( Xi ) evaluated 
under distribution qf(x t ) (see appendix). Combining this 
with the fact that E(G) is bounded below (for bounded 
G), establishes that the variance must go to zero for large 
enough /?. So as 0 grows, gf (xi) — + 0 for all Xi that don’t 
minimize E q(i) (G | x^. In other words, in that limit, g £ 
becomes Nash-optimal. 

Next consider varying over all q € Q, the space of all 
product distributions g. This is a convex space; if p G Q and 
p f e Q , then so is any distribution on the line connecting 
p and p '. However over this space, the E(G) term in L is 
multilinear. So L is not a simple convex function of q. So 
we do not have guarantees of a single local minimum. 

The following lemma extends the technique of Lagrange 
parameters to off-equilibrium points: 

Lemma 1: Consider the set of all vectors leading from 
x' £ M n that are, to £rst order, consistent with a set of 
constraints over R n , { f t (x ) = 0}. Of those vectors, the 
one giving the steepest ascent of a function V(x) is u = 
W -b J2i ^ V/*, up to an overall proportionality constant, 
where the A t enforce the £rst order consistency conditions. 

Now examine the derivatives of S(q) with respect to all 
components of q , i.e., the q-gradient of the entropy. At 
the border of Q , at least one of the ln(g;) terms in those 
derivatives will be negative in£nite. Combined with Lemma 
1, this can be used to establish that at the edge of Q, the 
steepest descent direction of any player’s Lagrangian points 
into the interior of Q (assuming £nite /3 and {G}). (This is 
resected in the equilibrium solutions Eq. 2.) Accordingly, 
whereas Nash equilibria can be on the edge of Q (e.g., 
for a pure strategy Nash equilibrium), in bounded rational 
games any equilibrium must lie in the interior of Q . In 
other words, any equilibrium (i.e., any local minimum) of a 
bounded rational game has non-zero probability for all joint 
moves. So just as when only varying a single g t , we never 
have to consider extremal mixed strategies in searching for 
equilibria over all Q. We can use local descent schemes 
instead [15], [19], [29]. 

Lemma 1 can also be used to construct G with more 
than one solution to Eq. 2. One can also show that for 
every player i and any point q interior to Q, there are 
directions in Q along which V s Lagrangian is locally 
convex. Accordingly, no player’s Lagrangian has a local 
maximum interior to Q. So if there are multiple local 
minima of z’s Lagrangian, they are separated by saddle 
points across ridges. In addition, the uniform q is a solution 
to the set of coupled equations Eq. 2, but typically is not a 


local minimum, and therefore must be a saddle point. 

Say that we were not restricting ourselves to product dis- 
tributions. So the Lagrangian becomes L{p) = (3(E P (G) - 
7 )— 5(p), where p can now be any distribution over x. There 
is only one local minimum over p of this Lagrangian, the 
canonical ensemble: 

P^(x) OC e -P G ( x ) 

In general pP is not a product distribution. However we can 
ask what product distribution is closest to it 

Now in general, the proper way to approximate a target 
distribution p with a distribution from a subset C of the set 
of all distributions is to £rst specify a mis£t measure saying 
how well each member of C approximates p, and then solve 
for the member with the smallest mis£t. This is just as true 
when C is the set of all product distributions as when it is 
any other set. 

How best to measure distances between probability distri- 
butions is a topic of ongoing controversy and research [30]. 
Hie most common way to do so is with the in£nite limi t 
log likelihood of data being generated by one distribution 
but misattributed to have come from the other. This is know 
as the Kuilback-Leibler distance [22], [31], [23]: 

KL(pi\\p 2 ) = S{p 1 \\p 2 )~S{p 1 ) (3) 

where S(p x ||p 2 ) = - f dx Pl (x)ln[^] is known as the 
cross entropy from p 2 to p 2 (and as usual we implicitly 
choose uniform p). The KL distance is always non-negative, 
and equals zero iff its two arguments are identical. 

As shorthand, de£ne the “pg distance” as KL(p || g), 
and the “gp distance” as KL(q || p), where p is our 
target distribution and q is a product distribution. Then it 
is straightforward to show that the qp distance from q to 
target distribution p 13 is just the maxent Lagrangian L(g), 
up to irrelevant overall constants. In other words, the g 
minimizing the maxent Lagrangian is q with the minimal 
qp distance to the associated canonical ensemble. 

However the qp distance is the (in£nite limit of the 
negative log of) the likelihood that distribution p would 
attribute to data generated by distribution g. It can be argued 
that a better measure of how well q approximates p would 
be based on the likelihood that q attributes to data generated 
by p. This is the pq distance; it gives a different Lagrangian 
from that of Eq. 1. 

Evaluating, up to an overall additive constant (of the 
canonical distribution’s entropy), the pq distance is 

KL{p || q) = ~YlJ (**)]• 

This is equivalent to a game where each coordinate i has 
the “Lagrangian” 

L*(q ) = - J ctei Pi(x f )ln[^(i)], (4) 

where Pi(xi) is the marginal distribution f dx^p(x). The 
minimizer of this is just g^ = p* Vz, i.e., each g* is set to 
the associated marginal distribution of p. 


In the interests of space, the rest of this paper we restrict 
attention to the pq KL distance and associated maxent 
Lagrangian. 


III. Descent of the maxent Lagrangian 
A. Gradient descent 


Consider the situation where each X* can take on a 
£nite number of possible values, |£*|. Say we are iteratively 
evolving q to minimize L for some £xed /3, and are currently 
at some point q € Q. Using Lemma 1, we can evaluate the 
direction from q within Q that, to £rst order, will result in 
the largest drop in the value of L(q): 


om 

dQi{Xi — j) 


^CO-y^x')/ ]&|, (5) 


where Ui(j) = 0E(G \ x t = j) + ln^fj)]. (Intuitively, 
the reason for subtracting ;U*(x£)/|£i| is to keep the 
distribution in the set of all possible probability distributions 
over x , V .) 

Eq. 5 speci£es the change that each agent should make 
to its distribution to have them jointly implement a step in 
steepest descent of the maxent Lagrangian. These updates 
are completely distributed, in the sense that each agent’s 
update at time t is independent of any other agents’ update 
at that time. Typically at any t each agent i knows q^t) 
exactly, and therefore knows ln[<p(j)]. However often it will 
not know G and/or the q (i). In such cases it will not be able 
to evaluate the E(G \ Xi — j) terms in Eq. 5 in closed form. 

One way to circumvent this problem is to have those 
expectation values be simultaneously estimated by all agents 
by repeated Monte Carlo sampling of q to produce a set of 
(x, G(x)) pairs. Those pairs can then be used by each agent 
i to estimate the values E(G j X{ = j), and therefore how 
it should update its distribution. In the simplest version of 
such an update to q only occurs once every L time-steps. 
In this scheme only the samples (x, G(x)) formed within a 
block of L successive time-steps are used at the end of that 
block by the agents to update their distributions (according 
to Eq. 5). 


B. Higher order descent schemes 

In general, second order descent (e.g., Newton’s method) 
of the maxent Lagrangian is non-trivia!, due to coupling that 
arises between the agents and the requirement for associated 
matrix inversion. However recall that one way to motivate 
the entropic product distribution Lagrangian L(q) starts 
by saying that what we really want is the fully coupled 
canonical ensemble distribution, p^(x) oc exp(—/3G(x)). 
L{q) then measures qp KL-distance to that desired distri- 
bution. From this perspective, any given iteration of second 
order descent of the maxent Lagrangian runs downhill on a 
quadratic approximation to a distribution, a distribution that 
is itself a product distribution approximation to the ultimate 
distribution we want to minimize. 

This suggests the alternative of making the approxima- 
tions in the opposite order. In this approacfi we £rst make 


a quadratic approximation (over the space of all p, not 
just all q) to the maxent Lagrangian, L(p). Via Newton’s 
method this speci£es a p* that minimizes that quadratic 
approximation. We can then £nd the product distribution 
that is nearest (in pq KL distance) to p*. This scheme is 
called Nearest Newton descent. 

The gradient and Hessian of L(p) are given by 


8L 

dp(x) p=p ° 
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dp(x)dp(x') p=p ° 


fiG(x) + l-j-ln(p°(x)) 
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where p° is the current point. This Hessian is positive- 
de£nite (given that the current p is a member of V ). By 
simple Lagrange parameters, the general solution is either 
on the border of V, or if in the interior is given by 

p*(x) = -p°(x) {[3G(x) + ln(p(x)) + A] 


where A is set by normalization. Solving, either p* is on 
the edge of the simplex, or 


p°(x) 


= 1 - S(p°) - In(p°(x)) - (3[G(x) - E(G)}. 


Note that the right-hand side is exactly the direction 
you should go using (simplex-constrained) gradient descent 
of L(p). So the direction to p* from p° is given by the 
Hadamard product of p° and the direction given by gradient 
descent. 

Now we can approximate p* with the product distribution 
having the minimal KL distance to it. In particular, consider 
using pq KL distance rather than qp KL distance. Recall 
that for this kind of KL distance, the optimal product 
distribution approximation to a joint distribution is given 
by the product of the marginals of that joint distribution 
(see the discussion just below Eq. 4). Say that p° is in the 
form of a product distribution, g°, i.e., that we are starting 
from a product distribution. Then calculating the marginals 
of the associated p* to get q* is trivial: 


|M = 1 _ S( 9 °)-ln(<j?(/)) 

- P[E(G \ ^ = j) - E(G)} (6) 

Note that since the original quadratic approximation was 
over the full joint space, this formula automatically takes 
into account inter-agent couplings. In practice of course, it 
may make sense not to jump all the way from q° to q*, 
but only part-way there, to be conservative. (In fact, if q* 
isn’t in the interior of the simplex, such partial jumping is 
necessary.) One potential guide to how far to jump is the 
pq KL distance from p* to Yli Q*- Unlike the KL distances 



to the full joint Boltzmann distribution, we can readily 
calculate this KL distance. 

The conditional expectations in Nearest Newton are the 
same as those in gradient descent Accordingly, they too can 
be estimated via Monte Carlo sampling, if need be. It’s also 
worth noting that Eq. 6 has the same form as one would 
get by evaluating the Hessian of the maxent Lagrangian, so 
long as one ignored inter-agent aspects of that Hessian. 


C. Practical issues 


In practice, the block-wise Monte Carlo sampling to 
estimate descent directions described above can be pro- 
hibitively slow. The estimates typically have high variance, 
and therefore require large block size L to get a good 
descent direction. One set of ways to address this is to 
replace the team game with a non-team game, i.e., for each 
agent i have it estimate quantities E(g z | = j ) rather than 

E(G | Xi = j ), where each private utility g t is chosen to 
have much lower variance than G {11], [15], [10]. 6 7 

Another useful technique is to allow samples from pre- 
ceding blocks to be re-used. One does this by £rst “aging” 
that data to reject the fact that it was formed under a 
different . For example, one can replace the empirical 
average for the most recent block fc. 


Gij(k) = 
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with a weighted average over the expected G’s of all 
preceding blocks. 


V G. 
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for some appropriate aging constant k? 

Typically such ageing allows L to be vastly reduced, 
and therefore the overall minimization of L to be greatly 
sped up. For such small L though, it may be that the most 
recent block has no samples of some move ~ j . This 
would mean that Gij(k) is undefined. One crude way to 
avoid such problems is to simply force a set of samples 
of each such move if they don’t occur of their own accord, 
being careful to have the x ^ formed by sampling when 
forming those forced samples. 

Other useful techniques allow one to properly decrease 
the step size as one nears the border of Q. 


IV. Other Lagrangians For Finding Minima Of G 

There are many alternative Lagrangians to the ones 
described above. The section focuses on such alternative 
Lagrangians for the purpose of Ending argmin x G{x). Two 

6 Formally, this means that each agent t has a separate Lagrangian, 
formed from Eq. 1 by substituting g x for G. The associaed joint solution 
q is then given by substituting the appropriate gi for G in each instance 
of the coupled equations Eq. 2 (one instance for each i). See [14] for the 
relation of this to bounded rational game theory. 

7 Not all preceding Gij (m) need to be stored to implement this; 
exponential ageing can be done online using 3 variables per (i, j) pair. 


classes of such Lagrangians are investigated: variants of the 
Maxent Lagrangians, and variants of the two types of KL- 
distance Lagrangians. 

A. Maxent Lagrangians 

Say that after Ending the q that minimizes the Lagrangian, 
we HD sample that q, K times. We then take the sample 
that has the smallest G value as our guess for the x that 
minimizes G{x). For this to give a low x we don’t need 
the mean of the distribution q(G) to be low — what we 
need is that the bottom tail of that distribution is low. This 
suggests that in the E(G) term of the Maxent Lagrangian 
we replace 

?(*) -*■ 

■j(j) e ( K ~ IE. 

K,> 

where © is the Heaviside theta function. This new multiplier 
of G is still a probability distribution over x . It equals 
0 if G{x) is in the worst 1 — k percentile (according to 
distribution q) of G values, and otherwise. So under 
this replacement the E(G) term in the Lagrangian equals 
the average of G restricted to that lower /c’th percentile. For 
k — AT" 1 , our new Lagrangian forces attention in setting q 
on that outlier likely to come out of the if -fold sampling 
of q{G). 

One can use gradient descent and Monte Carlo sampling 
to minimize this Lagrangian, in the usual way. Note that 
the Monte Carlo process includes sampling the probability 
distribution ^ )j] ^ we j] ^ q. 

This means that only those points in the best rc’th percentile 
are kept, and used for all Monte Carlo estimates. This may 
cause greater noise in the Monte Carlo sampling than would 
be the case for k— 1. 

As an example, say that for agent i, all of its moves have 
the same value of E(G | x z ), and similarly for agent j, and 
say that G is optimal if agents i and j both make move 
0. Then if we modify the updating so that agent i only 
considers the best values that arose when it made move 0, 
and similarly for agent j, then both will be steered to prefer 
to make move 0 to their alternatives. This will cause them 
to coordinate their moves in an optimal manner. 

A similar modifeation is to replace G with /(G) in the 
maxent Lagrangian, for some concave nowhere-decreasing 
function /(.). Intuitively, this will have the effect of coor- 
dinating the updates of the separate q z at the end of the 
block, in a way to help lower G. It does this by distorting 
G to accentuate those x’s with good values. The price paid 
for this is that there will be more variance in the values of 
f(G) returned by the Monte Carlo sampling than those of 
<3, in general. 

Note that if q is a local minimum of the Lagrangian for G, 
in general it will not be a local minimum for the Lagrangian 
of f(G ) (the gradient will no longer be zero under that 
replacement, in general). So we can replace G with /(G) 
when we get stuck in a local minimum, and then return to G 



once q gets away from that local minimum. In this way we 
can break out of local minima, without facing the penalty 
of extra variance. Of course, none of these advantages in 
replacing G with /(G) hold for algorithms that directly 
search for an x giving a good G{x) value; x is a local 
minimum of G(x) x is a local minimum of f(G(x)). 

An even simpler modification to the E(G ) term than 
those considered above is to replace G(x) with 0[G(x) — 
K]. Under this replacement the E(G) term becomes the 
probability that G(x) > K . So minimizing it will push q to 
x with lower G values. For this modi£ed Lagrangian, the 
gradient descent update steps adds the following to each 

ot [pq(Lr < K | xj) + \n\q t {x l )') 

Ex; Pq( G < K \ x 'i) + ln (gt( x D) 
Ex; 1 

In gradient descent of the Maxent Lagrangian we must 
Monte Carlo estimate the expected value of a real number 
( G ). In contrast, in gradient descent of this modi£ed La- 
grangian we Monte Carlo estimate the expected value of a 
single bit: whether G exceeds K. Accordingly, the noise in 
the Monte Carlo estimation for this modi£ed Lagrangian is 
usually far smaller. 

In all these variants it may make sense to replace the 
Heaviside function with a logistic function or an expo- 
nential. In addition, in all of them the annealing schedule 
for K can be set by periodically searching for the K 
that is (estimated to be) optimal, just as one searches 
for optimal coordinate systems [14], [11]. Alternatively, a 
simple heuristic is to have K at the end of each block be 
set so that some pre-£xed percentage of the sampled points 
in the block go into our calculation of how to update q . 

Yet another possibility is to replace E(G) with the 
fifth percentile G value, i.e., with the K such that 
f dx f q(x')0(G(x f ) — K) = k. (To evaluate the partial 
derivative of that k with respect a particular qi(xi) one must 
use implicit differentiation.) 

B. KL-based Lagrangiarts 

Both the qp-KL Lagrangian and pq-KL Lagrangians 
discussed above had the target distribution be a Boltzmann 
distribution over G. For high enough /3, such a distribution 
is peaked near argmin x G(a;). So sampling an accurate 
approximation to it should give an x with low G, if f3 
is large enough. This is why one way to minimize G is 
to iteratively £nd a q that approximates the Boltzmann 
distribution, for higher and higher (3. 

However there are other target distributions that are 
peaked about minimizers of G. In particular, given any 
distribution the distribution 

= p(x)Q[K-G(x)] 

p[ > ~ f dx' p(x')Q[K — G(x')] 

is guaranteed to be more peaked about such minimizers 
than is p . So our minimization can be done by iterating 


the process of £nding the q that best approximates 6 P and 
then setting p = q. This is analogous to the minimization 
algorithm considered in previous sections, which iterates 
the process of £nding the q that best approximates the 
Boltzmann distribution and then increases (3. 

For the choice of pq-KL distance as the approximation 
error, the q that best approximates 0 P is just the product 
of the marginal distributions of 0 p . So at the end of each 
iteration, we replace 

/ & (t)g(i)( x (i))e[-fr - G{xi,x' {i) )\ 
f dx ! q f (x f )Q[K — G(x , )\ 
q'(G <K \ Xi ) 
q'(G < K) 
q\Xj \ G < K) 
q'(xi) 

where q f is the product distribution being replaced. The 
denominator term in this expression is known exactly to 
agent z, and the numerator can be Monte-Carlo estimated 
by that agent using only observed G values. So like gradient 
descent on the Maxent Lagrangian, this update rule is well- 
suited to a distributed implementation. 

Note that if we replace the Heaviside function in this 
algorithm with an exponential with exponent (3 , the update 
rule becomes 


«- 

]• 


qi(xi) 


E(e~P G | Xi) 
E(e~P G ) * 


where both expectations are evaluated under q l , the distribu- 
tion that generated the Monte Carlo samples. It’s interesting 
to compare this update rule with the parallel Brouwer update 
rule for the team game [19], [11], [12], to which it is 
very similar. 8 This update is guaranteed to optimize the 
associated Lagrangian, unlike the Brouwer update. On the 
other hand, since it is based on the pq-KL Lagrangian, 
as mentioned above there is no formal guarantee that this 
alternative to Brouwer updating will decrease E(G). 

This update rule is also very similar to the adaptive 
importance sampling of the original pq-KL approach dis- 
cussed in [11]. The difference is that in adaptive importance 
sampling the e~^ G ^ terms get replaced by / q* {x). 

Finally, consider using qp-KL distance to approximate 

* 0 <G(#)) 

q'( x ) j dx , rather than M-KL distance. In 

the Lagrangian for that distance the q f terms only contribute 
an overall additive constant. Aside from that constant, this 
qp-KL Lagrangian is identical to the Maxent Lagrangian. 


v. Conclusion 

Many problems in adaptive, distributed control can be 
cast as an iterated game. The coupling between the mixed 
strategies of the players arises as the system evolves from 
one instant to the next. This is what the system designer 

8 That update is a variant of £cticious play, in which we simultaneously 
replace each qi{xi) with its ideal value if were to be held £xed, given 
by Eq. 2. 



determines. Information theory tells us that the most likely 
joint strategy of the players, given a value of the expectation 
of the overall control objective function, is the minimizer 
of a particular Lagrangian function of the joint strategy. So 
the goal of the system designer is to speed evolution of the 
joint strategy to that Lagrangian minimizing point, lower 
the expectated value of the control objective function, and 
repeat. Here we elaborate the theory of algorithms that do 
this using local descent procedures, and that thereby achieve 
effcient, adaptive, distributed control. 

VI. Appendix 

This appendix provides proofs absent from the main text 


Since there must some x " such that qi{x") f 0, 3 Xi 
such that 0E(gi | x") 4- ln(g;(x")) is £nite. Therefore our 
component is negative in£nite. So L can be reduced by 
increasing qi{x[). Accordingly, no q having zero probability 
for some joint move x can be a minimum of i’s Lagrangian. 

ii) To construct a bounded rational game with multiple equi- 
libria, note that at any (necessarily interior) local minimum 
g, for each i, 

(3E{gi | x { ) + ln(fc(xi)) = 

P / dx(i)9i{xi,X(i))Y[qj{xj) + ln (gi(x < )) 

J 


A. Derivation of Lemma 1 

Proof: Consider the set of u such that the directional 
derivatives evaluated at x' all equal 0. These are the 
directions consistent with our constraints to £rst order. We 
need to £nd the one of those u such that D^V evaluated at 
x' is maximal. 

To simplify the analysis we introduce the constraint that 
[u| = 1. This means that the directional derivative D$V 
for any function V is just u - W. We then use Lagrange 
parameters to solve our problem. Our constraints on u are 
J2j u j ~ 1 Dufi(x') = u * V/i(x') = 0 Vi. Our 
objective function is D^V{x r ) = u • W(x'). 
Differentiating the Lagrangian gives 

2AoUi + ^A iV/ = W Vz. 

t 

with solution 

f , W-E.A.V/ 

* 2Ao 

A 0 enforces our constraint on |u|. Since we are only 
interested in specifying u up to a proportionality constant, 
we can set 2A 0 — 1. Rede£ning the Lagrange parameters 
by multiplying them by —1 then gives the result claimed. 

QED. 

B. Proof of claims following Lemma 1 

For generality, we provide the proofs in the general 
scenario where the private utilities g x may differ from one 
another. See the discussion in Section H[-C. 
i) De£ne ffq) = J dxiqi(xi), i.e., f is the constraint 
forcing q { to be normalized. Now for any q that equals 
zero for some joint move there must be an i and an x\ such 
that <Zi(x') = 0. Plugging into Lemma 1, we can evaluate 
the component of the direction of steepest descent along the 
direction of player V s probability of making move x': 

PL dfj 

dqi{xi) dqi{xi) 

pE{g i \x i ) + ]n(q i {x i )) 

[dx^[PE{gi j x") + ln(g t (x"))] 

fdx'/l 


must be independent of x^ by Lemma 1. So say 
there is a component-by-component bijection T(x) = 
(r 1 (x I ),T 2 (x 2 ),-..) that leaves all the { gj } unchanged, 
i.e., such that 9j (x) = gj{T(x)) Vx,j 9 . 

De£ne q' by q'(x) = q(T(x)) Vx. Then for any two 
values x\ and x?, 

PEqpgi I 4) + ln( 0 i(*i)) 

— dEn'igi I xf) 4- 1 n(< 7 -(xf)) 

P l d»«)*(*i»*w)II^^*iW + ln(9i(T(xJ))) 

J 

-pl dx (i) 5i(x?,x (i) ) JJgj(T(xj))) + ln(ft(r(x?))) 

J _ &i 

P [ dx(i)gi(xl iT 1 (xa))) TT Qi(xi) + ln(o,(r(x£ ))) 

' ' 

-P f d*(i)»<(^. r_ 1 (x(i))) n^'( x i)) + to («(r(af))) 

p [ *(«))) + bi (^( :r ( :r i))) 

j m 

- P f <to(oft( r ( a ^)»*w))II^(*j)) + to («(r(®?))) 

J _ 

PE q { 9i | T{x\)) + InfeCnx, 1 ))) 

- t 1E q ( 9i | T(x 2 i)) + ln( 9i (T(xf))) 

where the invariance of g x was used in the penultimate step. 

Since q is a local minimum though, this last difference must 
equal 0. Therefore q t is also a local minimum. 

Now choose the game so that Vi,Xi,T(xi) x*. (Our 
congestion game example has this property.) Then the only 
way the transformation q — ► q(T) can avoiding producing a 
new product distribution is if qt{xi) = <?i(x^) Vz,Xi,x', i.e., 

9 As an example, consider a congestion team game. In such a game all 
players have the same set of possible moves, and the shared utility G is a 
function only of the fc-indexed bit string {N(x, /:)}, where N(x, fc) = 1 
iff there is a move that is shared by exactly k of the players when the 
joint move is x. In this case T just permutes the set of possible moves in 
the same way for all players. 


g is uniform. Say the Hessians of the players’ Lagrangians 
are not all positive de£nite at the uniform q. (For example 
have our congestion game be biased away from uniform 
multiplicities.) Then that q is not a local minimum of the 
Lagrangians. Therefore at a local minimum, q ^ g(T). 
Accordingly, q and q(T) are two distinct equilibria. 


iii) To establish that at any q there is always a direction 
along which any player’s Lagrangian is locally convex, £x 
all but two of the {^}, go and g x , and £x both go and g : 
for all but two of their respective possible values, which we 
can write as g 0 (0), go(l), gi(0), and gi(l), respectively. So 
we can parameterize the set of q we’re considering by two 
real numbers, x = go(0) and y = gi(0). The 2x2 Hessian 
of L as a function of x and y is 



a \ 
I a — L_ ) 

y b—y / 


where a = 1 — go(0) — go(l) and b = 1 — gi(0) — g x (l), and 
a is a function of gi and Y[&o t i De ^ nin S 5 = £ + 
and t = ^ the eigenvalues of that Hessian are 


s + t± i y 4a 2 -f (s - t ) 2 
2 ’ 

The eigenvalue for the positive root is necessarily positive. 
Therefore along the corresponding eigenvector, L is convex 
at q. QED. 


iv) There are several ways to show that the value of 
E q 0 ([gi\i tq(i) ) must shrink as /? grows. Here we do so by 
evaluating the associated derivative with respect to /3. 

De£ne N(U) = f dy e~ u ( y \ the normalization constant 
for the distribution proportional to e~ u ^ y \ View the x 
indexed vector gf as a function of /3, g,. and q^y So we can 
somewhat inelegantly write E{gi) = E q f(fr guq{i) ), q{i) (9i)' 
Then one can expand 

dE(si) &M.N(fi\9i\i, qw )) 
dp ~ dp 2 

= — Var([5j]i )9(i) ) 

where the variance is over possible sampled according 
to qf( Xi ). QED. 
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